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We introduce a useful tool for analyzing boosting algorithms 
called the "smooth margin function," a differentiable approximation 
of the usual margin for boosting algorithms. We present two boosting 
algorithms based on this smooth margin, "coordinate ascent boost- 
ing" and "approximate coordinate ascent boosting," which are similar 
to Freund and Schapire's AdaBoost algorithm and Breiman's arc-gv 
algorithm. We give convergence rates to the maximum margin solu- 
tion for both of our algorithms and for arc-gv. We then study Ad- 
aBoost's convergence properties using the smooth margin function. 
We precisely bound the margin attained by AdaBoost when the edges 
of the weak classifiers fall within a specified range. This shows that a 
previous bound proved by Ratsch and Warmuth is exactly tight. Fur- 
thermore, we use the smooth margin to capture explicit properties of 
AdaBoost in cases where cyclic behavior occurs. 

1. Introduction. Boosting algorithms, which construct a "strong" classi- 
fier using only a training set and a "weak" learning algorithm, are currently 
among the most popular and most successful algorithms for statistical learn- 
ing (see, e.g., Caruana and Niculescu-Mizil's recent empirical comparison of 
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algorithms [3]). Freund and Schapire's AdaBoost algorithm [7] was the first 
practical boosting algorithm. AdaBoost maintains a discrete distribution 
(set of weights) over the training examples, and selects a weak classifier 
via the weak learning algorithm at each iteration. Training examples that 
were misclassified by the weak classifier at the current iteration then receive 
higher weights at the following iteration. The end result is a final combined 
classifier, given by a thresholded linear combination of the weak classifiers. 
See [13, 27] for an introduction to boosting. 

Shortly after AdaBoost was introduced, it was observed that AdaBoost 
often does not seem to suffer from overfitting, in the sense that the test error 
does not go up even after a rather large number of iterations [1, 5, 14]. This 
lack of overfitting was later explained by Schapire et al. [28] in terms of the 
margin theory. The margin of a boosted classifier on a particular example 
is a number between —1 and +1 that can be interpreted as a measure of 
the classifier's confidence on this particular example. Further, the minimum 
margin over all examples in the training set is often referred to simply as the 
margin of the training set, or simply the margin when clear from context. 
Briefly, the margin theory states that AdaBoost tends to increase the mar- 
gins of the training examples, and that this increase in the margins implies 
better generalization performance. 

A complete analysis of AdaBoost's margin is nontrivial. Until recently, it 
was an open question whether or not AdaBoost always achieves the max- 
imum possible margin. This question was settled (negatively) in [20, 22]; 
an example was presented in which AdaBoost's asymptotic margin was 
proved to be significantly below the maximum value. This example exhibited 
"cyclic" behavior, where AdaBoost's parameter values repeat periodically. 
So AdaBoost does not generally maximize the margin; furthermore, until 
the present work, the cyclic case was the only case for which AdaBoost's 
convergence was fully understood in the separable setting. When it cannot 
be proved that the parameters will eventually settle down into a cycle, Ad- 
aBoost's convergence properties are more difficult to analyze. Yet it seems 
essential to understand this convergence in order to study AdaBoost's gen- 
eralization capabilities. 

In this work, we introduce a new tool for analyzing AdaBoost and related 
algorithms. This tool is a differentiable approximation of the usual margin 
called the smooth margin function. We use it to provide the following main 
contributions. 

• We identify an important new setting for which AdaBoost's convergence 
can be completely understood, called the case of bounded edges. A spe- 
cial case of our proof shows that the margin bound of Ratsch and War- 
muth [17] is tight, closing what they allude to as a "gap in theory." This 
special case answers the question of how far below maximal AdaBoost's 
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margin can be. Furthermore, this clarifies in sharp and precise terms the 
asymptotic relationship between the "edges" achieved by the weak learn- 
ing algorithm and the asymptotic margin of AdaBoost. 
• We derive two new algorithms similar to AdaBoost that are based directly 
on the smooth margin. Unlike AdaBoost, these algorithms provably con- 
verge to a maximum margin solution asymptotically; in addition, they 
possess a fast convergence rate to a maximum margin solution. Simi- 
lar convergence rates based on the smooth margin are then presented for 
Breiman's arc-gv algorithm [2] answering what had been posed as an open 
problem by Meir and Ratsch [13]. 

1.1. The case of bounded edges. There is a rich literature connecting 
AdaBoost and margins. The margin theory of Schapire et al. [28] (later 
tightened by Koltchinskii and Panchenko [10]) showed that the larger the 
margins on the training examples, the better an upper bound on the gener- 
alization error, suggesting that, all else being equal, the generalization error 
can be reduced by systematically increasing the margins on the training 
set. Furthermore, Schapire et al. showed that AdaBoost has a tendency to 
increase the margins on the training examples. Thus, though not entirely 
complete, their theory and experiments strongly supported the notion that 
margins are highly relevant to the behavior and generalization performance 
of AdaBoost. 

These bounds can be reformulated (in a slightly weaker form) in terms of 
the minimum margin; this was the focus of previous work by Breiman [2], 
Grove and Schuurmans [9] and Ratsch and Warmuth [17]. It is natural, given 
such an analysis, to pursue algorithms that will attempt to maximize this 
minimum margin. Such algorithms included Breiman's arc-gv algorithm [2] 
and Grove and Schuurmans' LP- AdaBoost [9] algorithm. However, in appar- 
ent contradiction of the margins theory, Breiman's experiments indicated 
that his algorithm achieved higher margins than AdaBoost, and yet per- 
formed worse on test data. Although this would seem to indicate serious 
trouble for the margins theory, recently, Reyzin and Schapire [18] revisited 
Breiman's experiments and were able to reconcile his results with the mar- 
gins explanation, noting that the weak classifiers found by arc-gv are more 
complex than those found by AdaBoost. When this complexity is controlled, 
arc-gv continues to achieve larger minimum margins, but AdaBoost achieves 
much higher margins overall (and generally better test performance). Years 
earlier, Grove and Schuurmans [9] observed the same phenomenon; highly 
controlled experiments showed that AdaBoost achieved smaller minimum 
margins, overall larger margins, and often better test performance than LP- 
AdaBoost. 

Taken together, these results indicate that there is a delicate and complex 
balance between the performance of the weak learning algorithm, the mar- 
gins, the problem domain, the specific boosting algorithm being used, and 
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Fig. 1. Plot of T(r) versus r (lower curve), along with the function f(r) =r (upper 
curve ) . 



the test error. It is the goal of the current work to improve our understanding 
of the intricate relationships between these various factors. 

In considering these complex relationships, a piece of the puzzle may 
be determined theoretically by understanding AdaBoost's convergence. Ad- 
aBoost has been shown to achieve large margins, but not maximal margins. 
To be precise, Schapire et al. [28] showed that AdaBoost achieves at least half 
of the maximum margin, that is, if the maximum margin is p > 0, AdaBoost 
will achieve a margin of at least p/2. This bound was tightened by Ratsch 
and Warmuth [17] who showed that AdaBoost asymptotically achieves a 
margin of at least T(p) > p/2, where T : (0, 1) — > (0, oo) is the monotonically 
increasing function shown in Figure 1, namely, 

(1.1) T(r):^ - ln ( 1 "^) 



ln((l + r)/(l-r))' 

However there is still a large gap between T(p) and the maximum margin 
P- 

Our contribution is from the other direction; we have just described theo- 
retical lower bounds for the margin, whereas we are now interested in upper 
bounds. Previously, we showed that it is possible for AdaBoost to achieve a 
margin that is significantly below the maximal value [22]. In this work, we 
show that Ratsch and Warmuth's bound is actually tight. In other words, 
we prove that it is possible for AdaBoost to achieve an asymptotic margin 
arbitrarily close to Y(p). More generally, our theorem regarding the case of 
"bounded edges" says the following, where the "edge" measures the perfor- 
mance of the weak learning algorithm at each iteration: 

• If AdaBoost's edge values are within a range [p, p+cr] for some p > p, then 
AdaBoost's margin asymptotically lies within the interval [T(p),T(p + a)]. 



BOOSTING AND THE SMOOTH MARGIN 



5 



Hence there is a fundamental connection between the performance of the 
weak learning algorithm and AdaBoost's asymptotic margin; if AdaBoost's 
edges fall within a given interval, we can find a corresponding interval for 
its asymptotic margin. 

Now, since we have proven that we can more or less predetermine the 
value of AdaBoost's margin simply by specifying the edge values, we can 
perform a new experiment. Since the studies of Breiman [2] and Grove and 
Schuurmans [9] suggest that the margin theory cannot be easily tested using 
multiple algorithms, we now perform a controlled study with only one algo- 
rithm. The experiment in Section 7.2 consists of many trials with the same 
algorithm (AdaBoost) achieving different values of the margin on the same 
dataset. We find that as the (predetermined) margin increases, the proba- 
bility of error on test data decreases dramatically. Our experiment supports 
the margin theory; in at least some cases, a larger margin does correlate 
with better generalization. 

1.2. Convergence properties of new and old algorithms. Since AdaBoost 
may achieve a margin as low as T(p), and since it has the idiosyncratic 
(albeit fascinating and possibly helpful) tendency to sometimes get stuck in 
cyclic patterns [11, 22, 23], we are inspired to find algorithms that are similar 
to AdaBoost that have better convergence guarantees. We also study these 
cyclic patterns of AdaBoost as a special case for understanding its general 
convergence properties. 

Our first main focus is to analyze two algorithms designed to maximize the 
smooth margin, called coordinate ascent boosting and approximate coordi- 
nate ascent boosting (presented in our previous work [23] without analysis). 
Coordinate ascent /descent algorithms are optimization algorithms where a 
step is made along only one coordinate at each iteration. The coordinate, 
which is also the choice of weak classifier, is determined by the weak learning 
algorithm. AdaBoost is also a coordinate descent algorithm [2, 6, 8, 12, 16], 
but its objective function need not be directly related to the margin or 
smooth margin; in fact, AdaBoost's objective converges to zero whenever 
the asymptotic margin is any positive value. 

There are other algorithms designed to maximize the margin, though not 
based on coordinate ascent /descent of a fixed objective function. Here is a 
description of the known convergence properties of the relevant algorithms: 
AdaBoost does not converge to a maximum margin solution. Breiman's arc- 
gv algorithm [2, 13] has been proven to converge to the maximum margin 
asymptotically, but we are not aware of any proven convergence rate prior 
to this work. (Note that Meir and Ratsch [13] give a very simple asymp- 
totic convergence proof for a variant of arc-gv; however, they note that no 
convergence rate can be derived from the proof.) Ratsch and Warmuth's 
AdaBoost* algorithm [17] has a fast convergence rate, namely, it yields a 
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solution within T> of the maximum margin in 2(log 2 m)/i> 2 steps, where m 
is the number of training examples. However, the "greediness" parameter 
v must be manually entered (and perhaps adjusted) by the user; the algo- 
rithm is quite sensitive to v. If it is estimated slightly too large or too small, 
the algorithm either takes a long time to converge, or it will not achieve 
the desired precision. (E.g., the experiments in [17] show that the algorithm 
performs well only for v in a carefully chosen range. In [25], v was estimated 
slightly too small, and the algorithm did not converge in a timely manner.) 
For any fixed value of u, asymptotic convergence is not guaranteed and will 
generally not be achieved. 

In contrast to previous algorithms, the ones we introduce have a proven 
fast convergence rate to the maximum margin, they have asymptotic con- 
vergence to the maximum margin, they do not require a choice of greediness 
parameter since the greediness is adaptively adjusted based on the progress 
of the algorithm, and they are based on coordinate ascent of a sensible objec- 
tive, namely the smooth margin. The convergence rates for our algorithms 
and for arc-gv are custom-designed using recursive equalities for the smooth 
margin; we know of no standard techniques that would allow us to obtain 
such tight rates. 

We also focus on the convergence properties of AdaBoost itself, using the 
smooth margin as a helpful analytical tool. The usefulness of the smooth 
margin follows largely from an important theorem, which shows that the 
value of the smooth margin increases if and only if AdaBoost takes a "large 
enough" step. Much previous work has focused on the statistical properties 
of AdaBoost indirectly through generalization bounds [10, 28], whereas our 
goal is to explore the way in which AdaBoost actually converges in order to 
produce a powerful classifier. 

In Section 7.1, we use the smooth margin function to prove general prop- 
erties of AdaBoost in cases where cyclic behavior occurs, extending previous 
work [22, 23]. "Cyclic behavior for AdaBoost" means that the weak learn- 
ing algorithm repeatedly chooses the same sequence of weak classifiers, and 
the weight vectors repeat with a given period. When the number of train- 
ing examples is small, it is likely that this behavior will be observed. Our 
first main result concerning cyclic AdaBoost is a proof that the value of 
the smooth margin must decrease an infinite number of times modulo one 
exception. Thus, a positive quality which holds for our new algorithms does 
not hold for AdaBoost: our new algorithms always increase the smooth mar- 
gin at every iteration, whereas cyclic AdaBoost usually cannot. The single 
exception is the case where all edge values are identical. Our second result 
in this section concerns this exceptional case. We show that if all edges in a 
cycle are identical, then all support vectors (examples nearest the decision 
boundary) are misclassified by the same number of weak classifiers during 
the cycle. Thus, in this exceptional case, a strong equivalence exists between 
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support vectors; they are misclassified the same proportion of the time by 
the weak learning algorithm. 

Here is the outline for the full paper. In Section 2, we introduce our no- 
tation and explain the AdaBoost algorithm. In Section 3, we describe the 
smooth margin function that our algorithms are based on. In Section 4, 
we describe coordinate ascent boosting (Algorithm 1) and approximate co- 
ordinate ascent boosting (Algorithm 2), and in Section 5, the convergence 
of these algorithms is discussed, along with the convergence of arc-gv in 
Section 6. In Section 7, we show connections between AdaBoost and our 
smooth margin function. Specifically, in Section 7.1, we focus on cyclic Ad- 
aBoost, and in Section 7.2, we discuss the case of bounded edges, including 
the experiment described earlier. Sections 8, 9 and 10 contain proofs from 
Sections 3, 5, 6 and 7. Preliminary and less detailed statements of these 
results appear in [25, 26]. 

2. Notation and introduction to AdaBoost. Our notation is similar to 
that of Collins, Schapire and Singer [4]. The training set consists of examples 
with labels {(xj, yi)}i=i,..., m , m > 1, where (xj,yj) € X x {—1, 1}. The space 
X never appears explicitly in our calculations. Let 7i = {h\, . . . , h n } be the 
set of all possible weak classifiers that can be produced by the weak learning 
algorithm, where hj : X — > {—1, 1}. (The /ij's are not assumed to be linearly 
independent; it is even possible that both h and —h belong to TL.) Since our 
classifiers are binary, and since we restrict our attention to their behavior 
on a finite training set, we can assume the number of weak classifiers n 
is finite. We typically think of n as being large, m <C n, which makes a 
gradient descent calculation impractical; when n is not large, the linear 
program can be solved directly using an algorithm such as LP- AdaBoost [9] . 
The classification rule that AdaBoost outputs is /Ada,A where sign(/Ada,A) 
indicates the predicted class. The form of /Ada,A is 

f Ej=i ^3 h 3 

/Ada, A • — II \ II ' 
ll A Hl 

where A S R™ is the (unnormalized) coefficient vector. We define the 1-norm 
||A||i as usual: ||A||i :=Y^=i^j- ^ iteration t of AdaBoost, the coefficient 
vector is At, and the sum is denoted st := ||At||i. 

We define an m x n matrix M where = yihj(xi), that is, My = +1 if 
training example i is classified correctly by weak classifier hj , and — 1 other- 
wise. We assume that no column of M has all +l's, that is, no weak classi- 
fier can classify all the training examples correctly. (Otherwise the learning 
problem is trivial.) This notation is useful mathematically for our analysis; 
however, it is not generally wise to explicitly construct large M in practice 
since the weak learning algorithm provides the necessary column for each 
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iteration. M acts as the only "input" to AdaBoost in this notation, contain- 
ing all the necessary information about the weak learning algorithm and 
training examples. 

The margin theory developed via a set of generalization bounds that are 
based on the margin distribution of the training examples [10, 28], where 
the margin of training example i with respect to classifier A is defined to be 
j/i/Ada,A( x i)> or equivalently, (MA)j/|| A||i. These bounds can be reformu- 
lated (in a slightly weaker form) in terms of the minimum margin. We call 
the minimum margin over the training examples the margin of the training 
set, denoted //(A), that is, 



Any training example i whose margin is equal to the minimum margin //(A) 
will be called a support vector. (There is a technical remark about our def- 
inition of AdaBoost. At iteration t, the (unnormalized) coefficient vector 
is denoted At; i.e., the coefficient of weak classifier hj determined by Ad- 
aBoost at iteration t is At j . In the next iteration, all but one of the entries 
of At+i are the same as in At; the only entry that is changed (for index 
j = jt) is given a positive increment in our description of AdaBoost, i.e., 
At_|_ij t > Xt,j t - Starting from Ai = 0, this means that all the At for t > 1 
have nonnegative entries. We thus need to study the effect of AdaBoost 
only on the positive cone M™ := {A € M n ; Vj'Aj > 0}. This same formaliza- 
tion was implicitly used in earlier works [17, 28]. Note that there are also 
formalizations; e.g., see [19], where entries of A are permitted to decrease. 
The present formulation is also characterized by its focus on the coefficient 
vector A as the "fundamental object," as opposed to the functional J2j -\jAj 
defined by taking the Xj as weights for the hj. This is expressed by our 
choice of the l\ norm: ||A||i = 53 • \\j\ to "measure" A; if one focuses on the 
functional instead, then it is necessary to take into account that (because 
of the possible linear dependence of the hj) several different choices of A 
can give rise to the same functional. (E.g., if for some pair £,£' we have 
= —hi, then adding a to both A^ and Xi' does not change J2j ^jhj-) One 
must then use a norm that "quotients out" this ambiguity, as in (for in- 
stance) |||A||| := min{||a||i; J2j a jhj = ^jhj}- By restricting ourselves to posi- 
tive increments only, and using the £i-norm of At, we avoid those nonunique 
issues. For our new algorithms, we prove limt_ >00 [mini(MAt)i/||At||i] = p, 
and where p is the maximum possible value of this quantity (defined later). 
Since ||A t ||i > |||At|||, and p is an upper bound for these fractions, it fol- 
lows automatically that for our algorithms, limt^oo 

[minj(MAt)j/|||A|||] = p 

as well; i.e., we prove convergence to a maximum margin solution even for 
the functional based norm. AdaBoost itself cannot be guaranteed to reach 



p(X) := min 



(MA), 
l|A||i ' 
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the maximum margin solution in the limit, regardless of whether |[Aj||i or 
HI At ||j is used in the denominator.) 

A boosting algorithm maintains a distribution, or set of weights, over 
the training examples that is updated at each iteration t. This distribution 
is denoted d t £ A m , and df is its transpose. Here, A m denotes the sim- 
plex of m-dimensional vectors with nonnegative entries that sum to 1. At 
each iteration t, a weak classifier hj t is selected by the weak learning algo- 
rithm. The probability of error at iteration t, denoted <i_, of the selected 
weak classifier hj t on the training examples (weighted by the discrete dis- 
tribution d t ) is d- -=J2u-m - =-i\dti- Also, denote d + := 1 — <i_. Define 
Z + := {i : Mij t = +1}, the set of correctly classified examples at iteration t, 
and similarly define T„ := {i : Mij t = — 1}. Note that d+, d~,I+, and T„ de- 
pend on i; although we have simplified the notation, the iteration number 
will be clear from the context. 

The edge of weak classifier jt at time t is rt := (d^M)j t = d + — d- = 
1 — 2d-, with (•)& indicating the kth. vector component. Thus, a larger edge 
indicates a lower probability of error. Note that d+ = (1 + rt)/2 and d- = 
(1 -r t )/2. Also define 



Due to the von Neumann Min-Max theorem for 2-player zero-sum games, 



That is, the minimum value of the maximum edge (left-hand side) corre- 
sponds to the maximum value of the margin. We denote this value by p. 

We wish our learning algorithms to have robust convergence, so we will not 
generally require the weak learning algorithm to produce the weak classifier 
with the largest possible edge value at each iteration. Rather, we only require 
a weak classifier whose edge exceeds p, that is, jt £ {j ■ (df M)j > p}. This 
notion of robustness has been previously used for the analysis of AdaBoost* 
and arc-gv. Here, AdaBoost in the optimal case means that the best weak 
classifier is chosen at every iteration: jt £ argmaxj(df M),-, while AdaBoost 
in the nonoptimal case means that any good enough weak classifier is chosen: 
jt £ {j : (d^M)j > p}. The case of bounded edges is a subset of the nonopti- 
mal case for some p > p and a > 0, namely jt £ {j ' : p < (df M)j < p + a}. 

We are interested in the separable case where p > and the training error 
is zero; the margin specifically allows us to distinguish between classifiers 
that have zero training error. In the nonseparable case, AdaBoost's objective 
function F is an upper bound on the training error, and convergence is well 
understood [4]. Not only does AdaBoost converge to the minimum of F, but 
it has recently been shown that it converges to the solution of the "bipartite 




min max(d M) ? - = max min 
deA m j AeA„ i 
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'max 



1. Input: Matrix M, No. of iterations t v 

2. Initialize: Aij = for j = 1, . . . ,n, also dii = 1/m for i = 1, . . . , m, and 
ai = 0. 

3. Loop for t = l,...,t max 
jt € argmax„ (dj"M)j optimal case 



(a) 



jt £ {j ■ (d^M)j > p} nonoptimal case 



(b) r t = (dfM) it 

(c) gt = max[0,G(At)] where G(Xt) is denned in (3.1), G(Xt) = 

(-lnE^ie~ (MAt)l )M- 

Oi+ = — In ( ) AdaBoost 

2 \l-rj 

(d) < a * = 2 ^ n (l~~) _ 2 ^ n (l~~~~) approx coord ascent boosting 
If gt > 0, at = argmaxG(At + aej t ), coord ascent boosting 

a 

else use AdaBoost. 

(e) A f+ i = A t + a t ej t , where e Jt is 1 in position j t and elsewhere. 

(f ) st+i = s t + a t 

(g) d t+ i,i = d t ,ie- M ^/ Zt where z t = YT=i d t ,ie~ M ^ at 
4. Output: A w /s tmax 

Fig. 2. Pseudocode for the AdaBoost algorithm, coordinate ascent boosting and approxi- 
mate coordinate ascent boosting. 



ranking problem" at the same time; AdaBoost solves two problems for the 
price of one in the nonseparable case [21, 24]. However, in the separable case, 
where F cannot distinguish between classifiers since it simply converges to 
zero, the margin theory suggests that we not only minimize F, but also 
distinguish between classifiers by choosing one that maximizes the margin. 
Since one does not know in advance whether the problem is separable, in 
this work we use AdaBoost until the problem becomes separable, and then 
perhaps switch to a mode designed explicitly to maximize the margin. 

Figure 2 shows the pseudocode for AdaBoost, coordinate ascent boosting, 
and approximate coordinate ascent boosting. On each round of boosting, 
classifier j t with sufficiently large edge is selected (Step 3a), the weight of 
that classifier is updated (Step 3e), and the distribution d t is updated and 
renormalized (Step 3g). Note that Xtj = Yft=i a t^-ji=j' wnere lj t ~=j i s 1 if 
jl = j and otherwise. The notation e Jt means the vector that is 1 in position 
jt and elsewhere. 

2.1. AdaBoost is coordinate descent. AdaBoost is a coordinate descent 
algorithm for minimizing 

F W '■= E™ l e-( MA )< . This has been shown many 
times [2, 6, 8, 12, 16], so we will only sketch the proof to introduce our 
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notation. The direction AdaBoost chooses at iteration t (corresponding to 
the choice of weak classifier jt) in the optimal case is 



jt G arg max 

3 



dF(X t + aej 



da 



0=0 



m 

argmax£V (MAt)i My 

3 i=l 

arg max(d^M) j . 



3 

The step size AdaBoost chooses at iteration t is at, where at satisfies the 
following equation, that is, the equation for the line search along direction 
jt- 

_ dF(\ t + a t e jt ) _ ™ m \t+a t e k )) iM . . 

da t Jt ' 



d. 


b e ~ 


-a t _ 




1 


In 












2 




veL, 





OLt = - ln( -i ) = - In ( — * ^ = tanh 1 r t = 7*. 

Note that for both the optimal and nonoptimal cases, at > tanh -1 p > 0, by 
monotonicity of tanh -1 . 

In the nonseparable case, the dj's converge to a fixed vector [4]. In the 
separable case, the d^'s cannot converge to a fixed vector, and the minimum 
value of F is 0, occurring as || A||i — > 00. It is important to appreciate that 
this tells us nothing about the value of the margin achieved by AdaBoost 
or any other procedure designed to minimize F. In fact, an arbitrary algo- 
rithm that minimizes F can achieve an arbitrarily bad (small) margin. [To 
see why, consider any A G A n such that (MA),; > for all i, assuming we 
are in the separable case so such a A exists. Then lim^oo aX will produce a 
minimum value for F, but the original normalized A need not yield a max- 
imum margin.] So it must be the process of coordinate descent that awards 
AdaBoost its ability to increase margins, not simply AdaBoost 's ability to 
minimize F. The value of the function F tells us very little about the value 
of the margin; even asymptotically, it only tells us whether the margin is 
positive or not. 

A helpful property of AdaBoost is that we can do the line search at each 
step explicitly; that is, we have an analytical expression for the value of at 
for each t. Our second boosting algorithm, approximate coordinate ascent 
boosting, which incorporates an approximate line search, also has an update 
that can be solved explicitly. 



3. The smooth margin function G(A). We wish to consider a function 
that, unlike F, actually tells us about the value of the margin. Our new 
function G has the nice property that its maximum value corresponds to 
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the maximum value of the margin. Here, G is defined for A € WL, ||A||i > 
by 

( 3.i) G(A):= -^W = zl"E" ie -'^'') 



II All i EjAj 

One can think of G as a smooth approximation of the margin, since it 
depends on the entire margin distribution when ||A||i is small, and weights 
training examples with small margins much more highly than examples with 
larger margins, especially as ||A||i grows. The function G also bears a re- 
semblance to the objective implicitly used for e-boosting [19]. G has many 
nice properties that are useful for understanding its geometry: 



Proposition 3.1 (Properties of the smooth margin [25]). 

1. G(X) is a concave function (but not necessarily strictly concave) in each 
"shell" where ||A||i is fixed. 

2. The value of G(X) increases radially, that is, G(a\) > G(X) for a > 1. 

3. As ||A||i becomes large, G(X) tends to //(A). Specifically, 

In 777 

-Trr r +^W<G(X)<P(X). 

ll A lll 



Proof. It follows from properties 2 and 3 that the maximum value of 
G is the maximum value of the margin. 

The proofs of properties 1 and 2 are in Section 8. Oddly enough, a lack 
of concavity does not affect our analysis, as our algorithms will iteratively 
maximize G, whether or not it is concave. For the proof of property 3, 

m m 
me -/x(A)||A||i = y^ e -mi^(MA)i >^ r (MA), 

i=l i=l 
> e - mini (MA), = g-M-MII-Mli ^ 

and taking logarithms, dividing by ||A||i and negating yields the result. □ 



Since all values of the edge (even in the nonoptimal case) are required to 
be larger than the maximum margin p, we have for each iteration t, where 
recall St := ||A f ||i, 

In m 

(3.2) + p(X t )<G{X t )<p(X t )<p<r t . 

st 
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4. Derivation of algorithms. We now suggest two boosting algorithms 
that aim to maximize the margin explicitly (like arc-gv and AdaBoost*), 
are based on coordinate ascent and adaptively adjust their step sizes (like 
AdaBoost). Before we derive the algorithms, we will write recursive equa- 
tions for F and G. This will provide a method for computing the values of F 
and G at iteration t + 1 in terms of their values at iteration t. The recursive 
equation for F is 

F(X t + ae jt ) 

m 

= y^ e -(M(At+ae it )) i = e -(MA t )» e -a + e -(MX t ) i(} a 



i=i iex+ iex 

[d+e- Q + d_e Q ]F(A t ) = 
[cosh a — r t sinh a]F(X t 



Here we remind the reader that coshx = (e x + e~ x )/2, sinhx = (e x — e~ x ) /2, 
and so cosh(tanh _1 x) = (1 — x 2 ) _1//2 . Recall the definition jt '■= tanh _1 rt. 
Continuing to reduce, we find the recursive equation for F, 

cosh 7 t cosh q — sinh 7t sinh a 

F(X t + ae jt ) = F(X t ) 

cosh 7* 

(4.1) 

cosh(7 t -a) 

= r ^ (.At J. 

cosh 74 

Here we have used the identity cosh (x — y) = cosh x cosh y — sinh x sinh y . Now 
we find a recursive equation for G. By definition of G, we know — In F(Xt) = 
stG(Xt). Taking the logarithm of (4.1) and negating, 

(s t + a)G(X t + ae jt ) = - ]xiF(X t + ae jt ) 

'cosh(7 t — a) 



= -lnF(Ai) - In 

(4.2) 

= Si G(A 4 ) + ln 



cosh 7t 
cosh 74 



cosh(7 4 — a) 
/•7t 

sjG(At) + / t&nhudu. 

J It— a 



>tt- 

Thus, we have a recursive equation for G. We will derive two algorithms; in 
the first, we assign to at the value a that maximizes G(Xt + aej t ), which 
requires solving an implicit equation. In the second algorithm, we pick an 
approximate value for the maximizer that can be computed in a straight- 
forward way. In both cases, since it is not known in advance whether the 
problem is separable, the algorithm starts by running AdaBoost until G(X) 
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becomes positive, which eventually must happen (in the separable case) by 
the following: 

Proposition 4.1. In the separable case (where p > 0), AdaBoost achieves 
a positive value for G(Xt) for some iteration t. 

Proof. For the iteration defined by AdaBoost (i.e., at = "ft = tanh -1 rt), 
we have from (4.1) 

F(X t+1 ) = F(X t + lt e Jt ) = ^^-F(X t ) = (1 - r^ 2 F(X t ) 

^(l-p^FiXt). 

Hence, by this recursion, F(X t+ i) < (1 -p 2 )*/ 2 F(Ai). It follows that exceed- 
ing at most 

2\nF(X 1 ) 
-ln(l-p 2 ) 

iterations, F(X t ) < 1 so that G(X t ) = (-]nF(X t ))/s t > 0. □ 

For convenience in distinguishing the two algorithms defined below, we 
denote A^ , . . . , A^ to be a sequence of coefficient vectors generated by 
Algorithm 1, and X[ ,...,X t to be a sequence generated by Algorithm 
2. Similarly, we distinguish the sequences a[ from af^, := G(A^), 

gf* := G(A^), sf" := J2j ^t] anc ^ s t := ^t\- Sometimes we compare the 
behavior of Algorithms 1 and 2 based on one iteration (from t to t + 1) as if 
they had started from the same coefficient vector at iteration t; we denote 
this vector by Aj. When an equation holds for both Algorithm 1 and Algo- 
rithm 2, we will often drop the superscripts. Although sequences such as jt, 
rt, "ft, and dt are also different for Algorithms 1 and 2, we leave the notation 
without the superscript. 

Note that it is important to compute gt in a numerically stable way. The 
pseudocode in Figure 2 might thus be replaced with 

lnV m -, p -[(MAi) i -min i /(MA) i ,] 

G(X t ) = n(Xt)- 



St 

mmi(MX t )i 



where yu(A 



st 



4.1. Algorithm 1: coordinate ascent boosting. Let us consider coordinate 
ascent on G. In what follows, we will use only positive values of G, as we 
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have justified via Proposition 4.1. The choice of direction jt at iteration t 
(in the optimal case) obeys 



jt £ argmax 



da 



a=0 



arg max 

j 



££ ie -(MXW) iM| 



f(a[ 1] ) J4 1] (4 1 



1 + ln(F(Ai 1] )) 



\2 



Of the two terms on the right, the second term does not depend on j, and 
the first term is simply a constant times (dJ'M)j. Thus the same direction 
will be chosen here as for AdaBoost. The "nonoptimal" setting we define 
for this algorithm will be the same as AdaBoost's, so the weak learning 
algorithm (Step 3a) of Algorithm 1 will be the same as AdaBoost's. 

To determine the step size, ideally we would like to maximize G(A^ + 

aej t ) with respect to a, that is, we would like to define the step size to 

obey dG(\\^ + aejj/da = for a = a[ . Differentiating (4.2) gives 

( M + a) dG(\f+ ae ]t ) + w + = _ 
Thus, our ideal step size aj 1 ' satisfies 

(4.3) GCAjJi) = G(AW + a[ 1] eji ) = tanh( 7 t - af ] ). 

There is not a nice analytical solution for a^ (as there is for AdaBoost), 

but minimization of G(\^ + aej t ) is one-dimensional so it can be performed 
reasonably quickly. Hence we have defined the first of our new boosting algo- 
rithms, coordinate ascent on G, implementing a line search at each iteration. 
Furthermore: 

Proposition 4.2. The solution for a[ is unique, for some a[ > 0. 

Proof. First, we rewrite the line search equation (4.3) using (4.2), 

4 1] G(AM) + ln( C ; Sh7 * ) = (4 11 + ai 1] )tanh( 7 , - a?). 
\cosh( 7 £ — a t )/ 

Consider the function /t, 

/ t (a) := 4 1] G(A| 1] ) + lnf °™ hlt ) - tf ] + a) tanh( 74 - a). 

\cosh(7 t -a)) 

Now, dft(a)/da = (a + 4^)sech 2 ( 7 t — a) > for a > 0. Thus /t is strictly 
increasing, so there is at most one root. We also have /t(0) = s[ (G(a[ ) — 
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r t ) < and f t (j t ) = s^G(X^) - \ ln(l - r\) > 0. Thus, by the intermediate 
value theorem, there is at least one root. Hence, there is exactly one solution 
for a*t where aj 1 ' > 0. □ 



Let us rearrange our equations slightly in order to study the update. Using 
Le notation 
(implicitly): 

(4.4) 



the notation gf^ := G(a[ 1 ) 1 1 ) in (4.3), we find that af^ satisfies the following 

- 7t - tanh- 1 (g g 1 ) = tanh" 1 r t - tanh" 1 (^ ) 
fll . 



2 



1 + n 1 - 



Since G(a[^ 1 ) > G(A[ 1 '), we again have G^aS^) > 0, and thus < tanhr^ = 
7 t . Hence, the step size for this new algorithm is always positive, and it is 
upper-bounded by AdaBoost's step size. 

4.2. Algorithm 2: approximate coordinate ascent boosting. The second 
of our two new boosting algorithms avoids the line search of Algorithm 1, 
and is even slightly more aggressive. It seems to perform very similarly to 
Algorithm 1 in our experiments. To define this algorithm, we consider the 
following approximate solution to the maximization problem, by using an 
approximate solution to (4.3) at each iteration in which Xt+i is replaced by 
At for tractability: 

(4.5) G(Al 2l ) = tanh( 7 t-«l 2] ), 

or more explicitly, 

a t It — tanh~ (g\ J ) = tanh - n — tanh - (g t ) 

(4.6) 



2 



l +rt l_0ph 



[21 [21 

The update a t is also strictly positive, since g[ J < p < rt, by (3.2). Note that 

[21 

this choice for a t given by (4.5) implies, by (4.2), using the monotonicity 
of tanh to take the lower endpoint on the integral, 

(4 2] + a! 2] )G(Agi) > 4 21 G(a[ 2] ) + af ] tanh( 7i - af ] ) 

= (s^ + a^G(X^), 

[21 [21 

so that G(X t ^_ 1 ) > G(X t ). That is, Algorithm 2 still increases G at every 

[2l 

iteration. In particular, G(\ t }^) is again strictly positive. 
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Algorithm 2 is slightly more aggressive than Algorithm 1, in the sense 
that it picks a larger relative step size at, albeit not as large as the step size 
defined by AdaBoost itself. We can see this by comparing equations (4.4) 
and (4.6). If Algorithms 1 and 2 were started at the same position At, with 
gt := G(Xt), then Algorithm 2 would always take a slightly larger step than 

Algorithm 1; since gf±_ 1 > gt, we have < af'K 

5. Convergence of smooth margin algorithms. We will show convergence 
of Algorithms 1 and 2 to a maximum margin solution. Although there are 
many papers describing the convergence of specific classes of coordinate 
descent/ascent algorithms, this problem did not fit into any of the existing 
categories. For example, we were unable to fit our algorithms into any of the 
categories described by Zhang and Yu [29] , but we did use some of their key 
ideas as inspiration for our proofs for this section, which can all be found in 
Section 9. 

One of the main results of this analysis is that both algorithms make sig- 
nificant progress at each iteration. In the next lemma, we are only consider- 
ing one increment, so we fix At at iteration t and let gt := G(At) , st := J2j ^t,j ■ 
Then, denote the next values of G for Algorithms 1 and 2, respectively, as 
#t+i := G ( x t + a[ 1] e if ) and gfl x := G(X t + af ] e jt ). Similarly, sj+i := s t + a[ 1] 
and 8t+i := Sf + a[ 2 ' . 

Lemma 5.1 (Progress at every iteration). 

[i] ^ a t (n-gt) [2] , a l t (rt-gt) 
gt+i~9t> p] and g\l x -gt> ^ • 

Another important ingredient for our convergence proofs is that the step 
size does not increase too quickly; this is the main content of the next lemma. 

Lemma 5.2 (Step size does not increase too quickly). 

[1] [2] 

ou on 
lim — ft,— = and lim = 0. 

6 t+l 

Lemmas 5.1 and 5.2 allow us to show convergence of Algorithms 1 and 2 
to a maximum margin solution. Recall that for convergence, it is sufficient 
to show that limt^oo^t = P since gt < /^(At) < p. 

Theorem 5.1 (Asymptotic convergence). Algorithms 1 and 2 converge 
to a maximum margin solution, that is, lim^oo g^ = p and limt^oo g^ = p. 
And thus, limt_ >0O //(A' ) = p and lim^oo //(A[ ) = p. 
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Theorem 5.1 guarantees asymptotic convergence, without providing any 
information about a rate of convergence. In what follows, we shall state 
two different results about the convergence rate. The first theorem gives 
an explicit a priori upper bound on the number of iterations needed to 
guarantee that gf^ or gf^ is within e > of the maximum margin p. As 
is often the case for uniformly valid upper bounds, the convergence rate 
provided by this theorem is not optimal, in the sense that faster decay of 
p — gt can be proved for large t if one does not insist on explicit constants. 
The second convergence rate theorem provides such a result, stating that 
p — gt = 0(i -1// ( 3+5 )), or equivalently p — gt < e after 0(e~ ( - 3+s ^) iterations, 
where 5 > can be arbitrarily small. 

Both convergence rate theorems rely on estimates limiting the growth 
rate of at- Lemma 5.2 is one such estimate; because it is only an asymptotic 
estimate, our first convergence rate theorem requires the following uniformly 
valid lemma. 

Lemma 5.3 (Step size bound). 

a t — c i ' c 2 s t una ctj. S Ci + C2S t , 

where 

In 2 p 

c\ = and c<i = . 

l-p 1-p 

We are now ready for a first convergence rate theorem. We leave off su- 
perscripts when the statement is true for both algorithms. 

Theorem 5.2 (Convergence rate). Let 1 be the iteration at which G 
becomes positive. Then both the margin p(Xt) and the value of G(\t) will be 
within e of the maximum margin p within at most 

l + (s i +ln2)e-( 3 "^ 1 -^ 

iterations, for both Algorithms 1 and 2. 

In practice p is unknown; this means one cannot use Theorem 5.2 directly 
in order to get an explicit numerical upper bound on the number of iterations 
required to achieve the given accuracy e. However, if R is an explicit upper 
bound on p, then the same argument can be used to prove that gt will exceed 
p — e within at most 

l + ( Si+ l n2 ) e -(3-«)/( 1 -«) 

iterations. If R is close to p, this bound becomes tighter. As we iterate, we 
can obtain increasingly better upper bounds Rt on p as follows: since we 
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have assumed that the weak learning algorithm produces an edge of at least 
p, that is, r£> p for all £, it follows that Rt ■= min^<i 77 is an upper bound 
for p. Rt is known explicitly at iteration t since the numerical values for all 
the 77 where I < t are known. We thus obtain, as a corollary to the proof of 
Theorem 5.2, the following result, valid for both algorithms. 

Corollary 5.1. Let 1 be the iteration at which G becomes positive. At 
any later iteration t, if the algorithms are continued for at most 

At := 1 + (aj + In 2 )e~^ Rt)/{l ~ Rt) - t 

additional iterations, where Rt = min^r^, then gt+At € [p — e,p\. 

That is, the value of G will be within e of the maximum margin p in 
at most At additional iterations. Note that if At is negative, then we have 
already achieved gt £ [p — s, p\. 

An important remark is that the technique of proof of Theorem 5.2 is 
much more widely applicable. In fact, we later use this framework to prove 
a convergence rate for arc-gv. The proof used only two main ingredients, 
Lemmas 5.1 and 5.3. Note that AdaBoost itself obeys Lemma 5.3; in fact, 
a bound of the same form can be seen solely from Lemma 5.3 and one 
additional fact, namely, starting from At, the step size at for AdaBoost only 
exceeds and otf 1 by at most a constant, specifically |ln(j^). It is the 
condition of Lemma 5.1 that AdaBoost does not obey; AdaBoost does not 
make progress with respect to G at each iteration as we discuss in Section 7. 

The convergence rate provided by Theorem 5.2 is not tight; in fact, Algo- 
rithms 1 and 2 often perform at a much faster rate of convergence in practice. 
The fact that the step-size bound in Lemma 5.3 holds for all t allowed us 
to find an upper bound on the number of iterations; however, we can find 
faster convergence rates in the asymptotic regime by using Lemma 5.2 in- 
stead. The following lemma again holds for both Algorithm 1 and Algorithm 
2, and we drop the superscripts. 

Lemma 5.4. For any < v < 1/2, there exists a constant C u such that 
for all t>l, 

P~9t< C v st v . 

Let us turn this into a convergence rate estimate. Note that the big-oh 
notation in this theorem hides constants that depend on the matrix M. 

Theorem 5.3 (Faster convergence rate). For both Algorithms 1 and 2, 
and for any 5 > 0, a margin within e of optimal is obtained after at most 
0(e~( s+ ^) iterations from the iteration 1 where G becomes positive. 
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Although Theorem 5.3 gives a better convergence rate than Theorem 5.2 
[since 3 < (3 — p)/(l — p)\, there is a constant factor that is not explicitly 
given. Hence, this estimate cannot be translated into an a priori upper bound 
on the number of iterations after which p — gt < £ is guaranteed, unlike 
Theorem 5.2 or Corollary 5.1. 

From our experiments with Algorithms 1 and 2, we have noticed that 
they converge much faster than predicted (see [25]). This is especially true 
when the edges are large. Nevertheless, the asymptotic convergence rate of 
Theorem 5.3 is sharp in the most extreme nonoptimal case where the weak 
learning algorithm always achieves an edge of p, as shown in the following 
theorem. This theorem is proved for Algorithm 2 only, as it conveys our 
point and eases notation. 

Theorem 5.4 (Convergence rate is sharp). Suppose rt = p for all t. 

Then, there exists no C > 0, 5 > 0, t > so that p - gf ] < CH 1 / 3 ) - * for 

allt > to. Equivalently, for all 5 > 0, limsup t _ +00 t 1+5 ( / o — g) 2 ') 3 = oo, showing 

that Algorithm 2 requires at least f](e -3 ) iterations to achieve a value of 
within e of optimal. That is, the convergence rate of Theorem 5.3 is sharp. 

6. Convergence of arc-gv. We have finished describing the smooth mar- 
gin algorithms. We will now alter our course; we will use the smooth margin 
function to study well-known algorithms, first arc-gv and then AdaBoost. 
arc-gv is defined as in Figure 2 except that the update in Step 3d is replaced 
by af rc , 

(Note that we are using Breiman's original formulation of arc-gv, not Meir 
and Ratsch's variation.) Note that af rc is nonnegative since pt < P < r t- 
We directly present a convergence rate for arc-gv; most of the important 
computations for this bound have already been established in the proof of 
Theorem 5.2. As before, we start from when the smooth margin is positive. 
For arc-gv, the smooth margin increases at each iteration (and the mar- 
gin does not necessarily increase). The result we state is weaker than the 
bound for Algorithms 1 and 2, since it is in terms of the maximum margin 
achieved up to time t rather than in terms of the smooth margin at time t. 
However, we note that the smooth margin does increase monotonically, and 
the true margin is never far from the smooth margin as we have shown in 
Proposition 3.1. Here is our guaranteed convergence rate: 

Theorem 6.1 (Convergence rate for arc-gv). Let 1 be the iteration at 
which G becomes positive. Then max^ = | t y /u(A^) will be within e of the 
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maximum margin p within at most 



iterations, for arc-gv. 

The proof is given in Section 9. 

7. A new way to measure AdaBoost's progress. In many ways, Ad- 
aBoost is still a mysterious algorithm. Although it often seems to converge 
to a maximum margin solution (at least in the optimal case), it was shown 
via some optimal case examples that it does not always do so [20, 22]. In 
fact, the difference between the margin produced by AdaBoost and the max- 
imum margin can be quite large; we shall see below that this happens when 
the edges are forced to be somewhat small. These and other results [2, 9, 22] 
suggest that the margin theory only provides a significant piece of the puzzle 
of AdaBoost's strong generalization properties; it is not the whole story. In 
order to understand AdaBoost's strong generalization abilities, it is essen- 
tial to understand how AdaBoost actually constructs its solutions. In this 
section, we make use of new tools to help us understand how AdaBoost 
makes progress. Namely, we measure the progress of AdaBoost according 
to a quantity other than the margin, namely, the smooth margin function 
G. We focus on two cases: the case where AdaBoost cycles, and the case of 
bounded edges, where AdaBoost's edges are required to be bounded strictly 
below 1. These are the only cases for which AdaBoost's convergence is un- 
derstood for separable data. 

First, we show that whenever AdaBoost takes a large step, it makes 
progress according to G. This result will form the basis of all other results 
in this section. We will use the superscript ^ for AdaBoost. Our analysis 
makes use of a monotonically increasing function T : (0, 1) — ► (0, oo), which 
is defined as 



One can show that T is monotonically increasing by considering its deriva- 
tive. A plot of T is shown in Figure 1. 

Theorem 7.1 (AdaBoost makes progress if and only if it takes a large 
step). 



In other words, G(\[ + \) > G(X t ) if and only if the edge r t is sufficiently 
large. 



T(r): 



ln(l-r 2 ) 
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0.85 




Fig. 3. Value of the edge at each iteration t, for a run of AdaBoost using a 12 x 25 
matrix M. Whenever G increased from the current iteration to the following iteration, a 
small circle was plotted. Whenever G decreased, a large circle was plotted. The fact that 
the larger circles are below the smaller circles is a direct result of Theorem 7.1. In fact, 
one can visually track the progress of G using the boundary between the larger and smaller 
circles. For further explanation of the interesting dynamics in this plot, see [22]. 



Proof of Theorem 7.1. Using the expression cq — It = tanh r t 
chosen by AdaBoost, the condition for G to increase (or at least stay con- 
stant) is G(x[ A ^) < G(X A + a[ A ^ej t ) = G(x[ A } 1 ), which occurs if and only 
if 

(4 A] + «^)G(A^) < (s[ A] + a[ A ^G(x[%) = 4 A1 G(A^) + £ tanh^, 
that is, 

[•A] 

G(X [A] ) < (£* tanhudu^j/a [A] = T(r t ), 

where we have used the recursive equation (4.2) and the fact that is a 
function of rt- Thus, our statement is proved. □ 



Hence, AdaBoost makes progress (measured by G) if and only if it takes 
a sufficiently large step. Figure 3 illustrates this point. 
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7.1. Cyclic AdaBoost and the smooth margin. It has been shown that 
AdaBoost's weight vectors (di,d2,...) may converge to a stable periodic 
cycle [22]. In fact, the existence of these periodic cycles has already been 
an important tool for proving convergence properties of AdaBoost in the 
optimal case; thus far, they have provided the only nontrivial cases in which 
AdaBoost's convergence can be completely understood. Additionally, they 
have been used to show that AdaBoost may converge to a solution with 
margin significantly below maximum, even in the optimal case. This myste- 
rious and beautiful cyclic behavior for AdaBoost often seems to occur when 
the number of training examples is small, although it has been observed in 
larger cases as well. Since this cycling phenomenon has proven so useful, we 
extend our earlier work [22] in this section. 

While Algorithms 1 and 2 make progress with respect to G at every iter- 
ation, we show that almost the opposite is true for AdaBoost when cycling 
occurs. Namely, we show that AdaBoost cannot increase G at every itera- 
tion except under very special circumstances. For this theorem, we assume 
that AdaBoost is in the process of converging to a cycle, and not necessarily 
on the cycle itself. The edge values on the cycle are denoted r J yc , . . . , r^ c , 
where the cycle has length T. (E.g., an edge close to rj yc is followed by an 
edge close to r^ , an edge close to r^li is followed by an edge close to r^ c , 
which is followed by an edge close to rj yc . Note that there are cases where 
the limiting edge values r J yc , . . . , r^ c can be analytically determined from 
AdaBoost's dynamical formulas [22]. For our theorem, we do not need to 
assume these values are known, only that they exist.) 

Theorem 7.2 (Cyclic AdaBoost and the smooth margin). Assume Ad- 
aBoost is converging to a cycle of T iterations. Then one of the following 
conditions must be obeyed: 

1. the value of G decreases an infinite number of times, or 

2. the edge values in the cycle rj yc , . . . , r^ c are equal (i.e., r° yc = • • • = r^ yc = 

r and thus r% — >t), and G(\\ ) — > T(r) as t —> oo. 

Thus, the value of G cannot be strictly increasing except in this very 
special case where AdaBoost's edges, and thus its step sizes, are constant. 
This is in contrast to our new algorithms, which make significant progress 
toward increasing G at each iteration. The proof of Theorem 7.2 can be 
found in Section 10. 

Note that some important previously studied cases fall under the excep- 
tional case 2 of Theorem 7.2 [22]. Hence we now look into case 2 further. In 
case 2, the value of G is nondecreasing, and the values of r£ yc are identical. 
Let us sort the training examples. Within a cycle, for training example i, 
either dt i = Vi or da > Vi. The examples i such that dt i > Vi are 
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support vectors by definition. It can be shown that the support vectors also 
attain the same (minimum) margin [22]. It turns out that the support vec- 
tors have a nice property in this case, namely, they are treated equally by 
the weak learning algorithm in the following sense: 

Theorem 7.3 (Cyclic AdaBoost and the smooth margin — exceptional 
case). Assume AdaBoost is within a cycle. If all edges in a cycle are the 
same, that is, r t = r Vt ; then all support vectors are misclassified by the same 
number of weak classifiers within the cycle. 

Proof. Consider support vectors i and i! . Since they are support vec- 
tors, they must obey the cycle condition derived from AdaBoost 's dynamical 
equations [22, 23], namely: IlLil 1 + M ijt r ) = 1 and Y\t=\{ 1 + M i'j t r ) = 1 - 
Here we have assumed AdaBoost started on the cycle at iteration 1 with- 
out loss of generality. Define Tj := \{t : 1 < t < T,Mij t = 1}[. Here, Tj repre- 
sents the number of times example i is correctly classified during one cycle, 
1 < n < T. 

T 

1 = + M ijt r) = (1 + r) n (1 - r) T ~ n = (1 + r) T »' (1 - r) T ~ T »' . 
t=i 

Hence, Tj = Ty . Thus, example i is classified correctly the same number of 
times that i 1 is classified correctly. Since the choice of i and i! was arbitrary, 
this holds for all support vectors. □ 

This theorem shows that a stronger equivalence between support vectors 
exists here; not only do the support vectors achieve the same margin, but 
they are all "viewed" similarly by the weak learning algorithm, in that they 
are misclassified the same proportion of the time. As we have found no 
substantial correlation between the number of support vectors, the number 
of iterations in the cycle, and the number of rows or columns of M, this 
result is somewhat surprising, especially since weak classifiers may appear 
more than once per cycle, so the number of weak classifiers is not even 
directly related to the number of iterations in a cycle. 

Another observation is that even if the value of G is nondecreasing for 
all iterations in the cycle (i.e, the exceptional case we have just discussed), 
AdaBoost may not converge to a maximum margin solution, as shown by 
an example analyzed in earlier work [22]. 

7.2. Convergence of AdaBoost with bounded edges. We will now give the 
direct relationship between edge values and margin values promised earlier. 
A special case of this result yields a proof that Ratsch and Warmuth's [17] 
bound on the margin achieved by AdaBoost is tight. This fixes the "gap in 
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theory" used as the motivation for the development of AdaBoost* . We will 
assume that throughout the run of AdaBoost, our weak classifiers always 
have edges within a small interval [p, p + a] where p > p. As p — > p and 
(7->0we approach the most extreme nonoptimal case. The justification for 
allowing a range of possible edge values is practical rather than theoretical; 
a weak learning algorithm will probably not be able to achieve an edge of 
exactly p at every iteration since the number of training examples is finite, 
and since the edge is a combinatorial quantity. Thus, we assume only that 
the edge is within a given interval rather than an exact value. Later we will 
give an example to show that we can force this interval to be arbitrarily 
small as long as the number of training examples is large enough. 

Theorem 7.4 (Convergence of AdaBoost with bounded edges). Assume 
that for each t, AdaBoost'' 's weak learning algorithm achieves an edge rt such 
that rt € [p, p + o~] for some p < p < 1 and for some a > 0. Then, 

lim sup g\ < T(p + a) 

t—*oo 

and 

liminf g[ A ^ > T( / o). 
t— >oo 

For the special case lim^oo rt = p, this implies 

lim g [A] = lim p(\ [A] ) = T(p). 

t^oo t — >oo 

This result gives an explicit small range for the margin //(A^), since from 

(3.2) and linn^oo ||A^||i — > oo, we have lim^oo (g[ A ^ — fi(\[ A ^)) = 0. (The 

statement lim^oo || a[ ||i — ► oo always occurs for AdaBoost in the separable 
case since the edge is bounded above zero.) The special case lim^oo r t = p 
shows the tightness of the bound of Ratsch and Warmuth [17] (see [15] for 
the proof). Their result, which we summarize only for AdaBoost rather than 

for the slightly more general AdaBoost^, states that liminf^oo ^(a|^) > 
T(r; n f), where r; n f = inft rt- (The statement of their theorem seems to assume 
the existence of a combined hypothesis and limiting margin, but we believe 
these strong assumptions are not necessary, and that their proof of the lower 
bound holds without these assumptions.) Theorem 7.4 gives bounds from 
both above and below, so we now have a much more explicit convergence 
property of the margin. The proof can be found in Section 10. 

Our next result is that Theorem 7.4 can be realized even for arbitrarily 
small interval size a. In other words, AdaBoost can achieve any margin with 
arbitrarily high accuracy; that is, for a given margin value and precision, we 
can construct a training set and weak learning algorithm where AdaBoost 
attains that margin with that precision. 
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Fig. 4. AdaBoost's probability of error on test data decreases as the margin increases. 
We computed nine trials, namely, eight trials of nonoptimal AdaBoost, £ — 1, ... ,8, and 
one trial of optimal AdaBoost (denoted via 1 = 0). For each nonoptimal trial £, a goal 
edge value rj was manually prespecified. For 3,000 iterations of each trial, we stored the 
edge values n t t and margins fie,t on the training set, along with the probability of error 
on a randomly chosen test set ee t t- A — edge versus margin. In each of the nine trials, 
we plot (fJ.e,t,re,t) for iterations t that fall within the plot domain. Later iterations tend 
to give points nearer to the right in the plot. Additionally, dots have been placed at the 
points (Y(ri'), rj) for £ = 1, . . . , 8. By Theorem 7.4, the asymptotic margin value for trial 
I should be approximately T(ri). Thus, AdaBoost's margins fj,e t t are converging to the 
prespecified margins T(rc). B — probability of error versus margins. The lower scattered 
curve represents optimal AdaBoost; for optimal AdaBoost, we have plotted all (^o,t,eo,t) 
pairs falling within the plot domain. For clarity, we plot only the last 250 iterations for 
each nonoptimal trial, that is, for trial £, there is a clump of 250 points ({J.i,t,et. t ) with 
margin values pn,t ~ T(rj). This plot shows that the probability of error decreases as the 
prespecified margin increases. C — edges ro,t (top curve), margins ro.t (middle curve) and 
smooth margins (lower curve) versus number of iterations t for only the optimal AdaBoost 
trial. 
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Theorem 7.5 (Bound of Theorem 7.4 is nonvacuous). Say we are given 
< p < 1 and a > arbitrarily small. Then there is some matrix M for which 
nonoptimal AdaBoost may choose an infinite sequence of weak classifiers 
with edge values in the interval [p,p + a]. Additionally for this matrix M, 
we have p> p (where p is the maximum margin for M). 

The proof, in Section 10, is by explicit construction, in which the num- 
ber of examples and weak classifiers increases as more precise bounds are 
required, that is, as the precision width parameter a decreases. 

Let us see Theorem 7.4 in action. Now that one can more or less predeter- 
mine the value of AdaBoost's margin simply by choosing the edge values to 
be within a small range, one might again consider the important question of 
whether AdaBoost's asymptotic margin matters for generalization. To study 
this empirically, we use AdaBoost only, several times on the same data set 
with the same set of weak classifiers. Our results show that the choice of 
edge value (and thus the asymptotic margin) does have a dramatic effect on 
the test error. Artificial test data for Figure 4 was designed as follows: 300 
examples were constructed randomly such that each Xj lies on a corner of 
the hypercube {—1, l} 800 . The labels are: yi = sign(^|=i Xj(&)), where Xj(fc) 
indicates the kih. component of Xj. For j = 1, . . . ,800, the jth weak classi- 
fier is /ij(x) = x(j), thus Mij = yiXi(j). For 801 <j < 1600, hj = — /i(j_soo)- 
There were 10,000 identically distributed randomly generated examples used 
for testing. The hypothesis space must be the same for each trial as a con- 
trol; we purposely did not restrict the space via regularization (e.g., norm 
regulation, early stopping, or pruning). Hence we have a controlled experi- 
ment where only the choice of weak classifier is different, and this directly 
determines the margin via Theorem 7.4. AdaBoost was run nine times on 
this dataset, each time for i max = 3,000 iterations, the first time with stan- 
dard optimal-case AdaBoost, and eight times with nonoptimal AdaBoost. 
For each nonoptimal trial, we selected a "goal" edge value r goa i (the eight 
goal edge values were equally spaced) . The weak learning algorithm chooses 
the closest possible edge to that goal. In this way, AdaBoost's margin is 
close to T(r goa i). The results are shown in Figure 4B, which shows test er- 
ror versus margins for the asymptotic regime of optimal AdaBoost (lower 
scattered curve) and the last 250 iterations for each nonoptimal trial (the 
eight clumps, each containing 250 points). It is very clear that as the margin 
increases, the probability of error decreases, and optimal AdaBoost has the 
lowest probability of error. 

Note that the asymptotic margin is not the whole story; optimal Ad- 
aBoost yields a lower probability of error even before the asymptotic regime 
was reached. Thus, it is the degree of "optimal-ness" of the weak learn- 
ing algorithm (directly controlling the asymptotic margin) that is inversely 
correlated with the probability of error for AdaBoost. 

Now that we have finished describing the results, we move on to the proofs. 
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8. Proof of Proposition 3.1. To show property 1 given assumptions on 
M, we will compute an arbitrary element of the Hessian H, 



H, 



d 2 G(X) 



+ 



d 2 F(X) 
d\k dXj 

>(A)||A||i 

9F(X) 
8X k 



+ 



dF(\) 



F(X) 



21nF(A) 

~1W 



+ 



dF(\) dF(\) 
d\j dX k 

FW 2 \\M\i 



For G to be concave, we need w r Hw < for all vectors w. We are consider- 
ing the case where w obeys J2j w j = so we are considering only directions 
in which ||A||i does not change. Thus, we are showing that G is concave on 



thus 



every "shell." Note that T,j,k w j w k^\T~ = (T l jWj^^ i )(Ek w k) = 0, and 



dF{\) s 



j,k 



-1 



F X) X 



d 2 F(X) 



+ 




+ + 



W )?e-( M ^ 




■(MA), 



+ 



E( M 



-(MA), 



Let the vectors *i and * 2 be defined as S&i^ := (Mw)ie"( MA ''/ 2 and * 2 ,i 
e -(MA)i/2_ rpkg Q auc j 1 y_g c ] iwarz inequality applied to and gives 

2 



E*? 




<0. 



Since this expression is identical to the one bracketed in (8.1), J2j,k w j w kHkj < 
0, and thus we have shown that the function G{X) is concave on each shell, 
but not strictly. Equality in the Cauchy-Schwarz equation is achieved only 
when is parallel to \l/2) that is, when (Mw)j does not depend on i. There 
are some matrices where such a w exists, for example, the matrix 

-1111 
A= I 1 -1 1 -1 
1 1 -1 -1 
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with vector w = (—^c,c,c,—^c), where cGR. Here, (Mw)j = c for all i. 



We have shown that the function G is concave for each "shell," but not 
necessarily strictly concave. (One can find out whether G is concave on each 
shell for a particular matrix M by solving Mw = cl subject to Ylj w j = 0) 
which can be added as a row.) We have now finished the proof of property 1. 

To show property 2, we compute the derivative in the radial direction, 
dG(\(l + a))/da\ a= o, and show that it is positive. We find, using the notation 
d i:=e -( MA )VF(A), 



dG(A(l + o)) 



da 



1 



a=0 



> 



> 



l 
1 



i=l 



X>(MA)i + lnF(A) 

=i 

m \ 

di min(MA), + lnj^e - ^ 1 



L \i=l / ' i=l 

min(MA)i + lne- min ^ MA ^ 



0. 



The very last inequality follows since from our m > 1 terms, we took only 
one term, and also since J2i di = 1. 



9. Convergence proofs. Before we state the proofs, we must continue our 
simplification of the recursive equations. From the recursive equation for G, 
namely (4.2) applied to Algorithm 1, 



b t+iyt+i 



In 



cosh 7 t 



cosh(7 4 — at) 



imf 1 



tanh 2 [p/t 



a 



1 — tanh jt 



(9.1) 



1 tanh( 7t - qf ] )] [1 + tanh( 7t - )] 

2 (1 -tanh 7t)(l + tanh 7*) 



i ln a-ggi)a+4i: 

2 (i_ rt )(i + rt ) 



[i] 



In 



1 + gj 
l + r t 



+1 



Here we have used both (4.3) and (4.4). We perform an analogous simplifi- 
cation for Theorem 5.2. Starting from (4.2) and applying (4.5) and (4.6), 



[2] [2] 

H+l9t+i 



s [2] a l2] 
s t 9t 



(9.2) 



1 (l- ft W) ( i +ft P 

2 (l-r t )(l+r t ) 



* V 1 + r t 



We will use equations (9.1) and (9.2) to help us with the proofs. 
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Proof of Lemma 5.1. We start with Algorithm 2. First, we note that 
since the function tanh is concave on M + , we can lower bound tanh on an 
interval (a, b) C (0, oo) by the line connecting the points (a,tanh(a)) and 
(6,tanh(6)). Thus, 

/"^ 1 [21 [21 

/ tanh udu> ^a t [tanh 74 + tanh(7t — a t )] 

J"ft—a\ 



(9.3) 



1 [21 

= 2 «t (rt + 9t), 

where the last equality is from (4.5). Combining (9.3) with (4.2) yields 

[2] [2] . 1 [2]/ . s 

4+i(&+i -gt) + af ] g t > \af\r t + gt), 

[2] <xt\n-9t) 
g t+1 -gt>—-j ] • 

Thus, the statement of the lemma holds for Algorithm 2. By definition, 

g^h is the maximum value of G(A t + ae Jt ), so gf^ > gfli- By (4.4) and 

(4.6), we know a[ < a[ 2 ' . Because a/(s + a) = 1 — s/(a + s) increases with 
a, 



g t+ i -gt> 9t+i -9t> ( — ) — o — - 

Thus, we have completed the proof of Lemma 5.1. □ 



Proof of Lemma 5.2. The proof holds for both algorithms, so we have 
dropped the superscripts. There are two possibilities; either lim^oo st = 00 
or lim^oo Sf < 00. We handle these cases separately, starting with the case 
lim^ooSt = 00. From (9.1) and (9.2), and recalling that gt < gt+i < p < 7*t 
we know 

. , , l + gt 
st+igt+i - s t gt >a t + in — — , 

l + n 

so that 

1 + r t 

a t (l-p) < a t (l-g t +i) < s t (g t +i -g t ) + ln — — . 

l + gt 

We denote by 1 the first iteration where G is positive, so > 0. Dividing 
by (1 — p)st, recalling that r t < 1 and < gt, 

at . at . gt+i - gt . 1 1 , l + n 

< — < 1 m ■ 



st+i s t 1-p 1 - p s t l + gt 

<9t +x-gt + 1 i ln , 2 



1 - p 1 - p S t 1 + 0j 
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We will take the limit of both sides as t — > oo. Since the values gt are mono- 
tonically increasing and are bounded by 1, limt_ i . 0O ((7t-|_i — gt) = 0. Hence, 
the first term vanishes in the limit. S St — co, the second term 

also vanishes in the limit. Thus, the statement of the lemma holds when 
s t -> oo. 

Now for the case where lim^oo sj < oo, consider 

By our assumption that lim^oo sj < oo, the above sequence is a bounded in- 
creasing sequence. Thus, Y^L\ ®t/st+i converges. In particular, lim^oo at/st+i 
0. □ 

Proof of Theorem 5.1. We choose to show convergence from the 
starting position Aj, where Aj is the coefficient vector at the first iteration 
where G is positive. This is the iteration where we switch from AdaBoost 
to our new iteration scheme; it suffices to show convergence from this point. 
For this proof, we drop the superscripts M and t 2 l; each step in the proof 
holds for both algorithms. 

The values of gt constitute a nondecreasing sequence that is uniformly 
bounded by 1. Thus, a limit goo must exist, goo := lim t _ >00 gt. By (3.2), we 
know that gt < p for all t. Thus, goo < p. Let us suppose that goo < p, that is, 
that p — goo 7^ 0. (We will show this assumption is not true by contradiction.) 

From Lemma 5.2, there exists a time to S N such that, for all times t > to, 
we have at/st+i < 1/2, or equivalently, at < st+i/2, and thus St = st+i — at > 
st+i/2, so that 

.<. a t 2a+ 

(9.4) — < — - fort >t . 

st s t +i 

From Lemma 5.1, since gt < goo and rt > p, we have 

/ a t a t (r t - g t ) . 

\P ~ 9oo)7, — < o — < 9t+l - 9t- 

2st+i s t+ i 2 

Thus, for all T € N, 

T T 

(9.5) (p - goo) - Yl(9t+i ~ 9t) = 9T+1 - g\ < L 

t=l + t=l 

Under our assumption p — goo 7^ 0, the inequality (9.5) implies that the series 
2~ltLi( a t/st+i) converges. This, combined with (9.4), implies that the series 
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YltLi{ a t/ s t) converges, since its tail is majorized, term by term, by the tail 
of a converging series. Therefore, for all T £ N, T > 1, 



oo T-1 

— > > — 

- s t ~ t-i. s t 
t=i t=i 

T-1 



T-1 

E 

t=i 



st +i - s t 

St 



T-1 

E 



s t+i i 



■ du 



t=i Jst 



s t 



> y. / — du = — du = In st — In . 

t = l 1 

Therefore, the st constitute a bounded, increasing sequence and must con- 
verge; define Sqo := limr^oo st < oo. The convergence of the st sequence 
implies that at = st+i — st must converge to zero: limt^oo at = 0. Finally, we 
use the fact that tanh is continuous and strictly increasing, together with 
(4.3) and (4.5), to derive 



lim gt = lim inf gt = tanh 

t— »oo t— >oo 



lim inf (7* - a t ) 

t—>oc 



lim inf jt — hm at 

t— >oo t— *oo 



tanh 

lim inf rt> p. 



tanh 



lim inf 7$ 

t— >oo 



lim inf [tanh 7t] 

t— >oo 



This is a contradiction with the original assumption that g^ < p. It follows 
that we have proved that = p, or lim^ 00 (/9 — g t ) = 0. □ 



Proof of Lemma 5.3. The proof works for both algorithms, so we 
leave off the superscripts. From (4.2), 

(9.6) st+igt+i - s t g t = In cosh j t - In cosh (7f - a t ). 

Because (l/2)e 5 < l/2(e 5 + e~Z) = cosh£ < e ? for £ > 0, we have £ - In 2 < 
In cosh £ < £. Combining this with (9.6), 

st+iflfc+i - s^i > 7* - In 2 - (7t - 

so 

- p) < a t (l - gt+i) < In 2 + 54(54+1 - ft) < In 2 + ps t . 

The first and last inequalities of the last line use the fact that G is positive 
and bounded by p, that is, 1 — p < 1 — gt+i and — g t < p. Thus, dividing 
both sides by (1 — p), we find the statement of the lemma. □ 



Proof of Theorem 5.2. Again the superscripts have been removed 
since all statements are true for both algorithms. Define AG(A) := p — G(X). 
Since (3.2) states that g t < //(At), we know < p — p(X t ) < p — gt = AG(Aj), 
and thus we need only to control how fast AG(At) — ► as t — > 00. That is, 
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if gt is within e of the maximum margin p, so is the margin //(At). Starting 
from Lemma 5.1, 



P - gt+i <p-gt 



Of 



2s 



t+i 



(r t - p + p-g t ), 



thus 



(9.7) 



AG(A m ) < AG(X t 



< AG(At) 



at 
at 



Oit{n - p) 



2s 



t+i 



t r 



<AG(Aj)n 



Here, the second inequality is due to the restriction rt > p and the fact 
that at > 0. The last inequality of (9.7) is from the recursion. We stop the 
recursion at Aj , where Aj is the coefficient vector at the first iteration where 
G is positive. Before we continue, we upper bound the product in (9.7), 



n 



Of 



(9.8) 



=n 

< exp 
= exp 

< exp 



1 



1 s e+1 - s e 



2 s^+i 



< exp 



1 ^ s i+ i - s e 



2 *— ; s£+i 



Si+l ~ Si 



1 * 

--Y 

2f^s e + p/(l-p)si + ln2/(l-p) 



1 - Psr^ s w ~ s i 



y 

2 £-~L s^ + ln2 

1 - p /" 8 *+i dv 



t> + ln2 



S j + ln2 ]M/ 2 



s t+ i + In 2 

Here, the first line holds since 1 — x < e~ x for all x, and the next line follows 
from our bound on the size of at in Lemma 5.3. Plugging back into (9.7), it 
follows that 



AG(A t ) < AG(Aj; 



or 



(9.9) 



St < s t + ln2 < (sj +ln2) 



si + ln2 
s t +ln2 

AG(\i 



(l-p)/2 



[AG(\t) 



2/(l-P) 
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On the other hand, we have (for Algorithm 2) 

PK+ v, PI + ur i Phi tanh 7f -tanh(7 f -qf) 

> tannc^ = tann[ 7 t — ( 7f — «j JJ = r^- 

1 — tanh 7( tanh(7t — cq ) 

_ r t - <?J 2] p-ffi 21 = AG(Aj 2] ) AG(A|^ 1 ) 



l-rt#J 2] 

A similar calculation for Algorithm 1 holds. Thus, for both algorithms we 
have at > AG(At+i)/(l — pg\) which implies 

, V> ^ , v> AG(A^+i) 
s t+ i =8i + 2^a t >8 1 +2^ j— — 

(9.10) 

> Si + (4 _ i+1) ^±i). 

Combining (9.9) with (9.10) leads to 

- (l-^t (l-pff I )(. I +ln2)[AG(A I )] 2 /(^) 
" AG (At) " [AG(A 4 )] 1 +[2/( 1 -^)] 
Si + ln2 



[AG(At)]( 3 -")/( 1 -") ' 

where we have used that (1 — pgj) < 1, AG(Aj) < 1. This means that AG(At) > 
e is possible only if t < 1 + (sj + ln2)£~( 3 ~ p )/( 1 ~ p ) . Therefore, if t exceeds 
1 + (sj + ln2)e"( 3 ~^/( 1 "^, it follows that AG(X t ) < e. This concludes the 
proof of Theorem 5.2. □ 

Proof of Lemma 5.4. We show that there is a T v such that after 
iteration T u , s^(p — gt) is a decreasing sequence, 

s v t+1 (p - g t+ i) <s v t (p-g t ) ioxt>T v . 

In this way, the value of C v will be determined by 

C v = max St(p-gt). 
te{i,...,T v } 

Let us examine our sufficient condition more closely. Using Lemma 5.1 we 
have, for arbitrary t, 

< (ft -9t)- st +1 (p - gt+i) = (st - s u t+l )(p -gt) + st+i(gt+i - gt) 

ott(r t -gt) 



>{s»-s v t+l )(p-g t ) + s» 



(9.11) 



2S*+1 
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> (s t - s t+1 )(p - gt) + s t+1 — 



(p - gt) 



1 



H+l + o S t+l ( s t+i - s t) 



Thus, it is sufficient to show that the bracketed term in (9.11) is positive for 
all sufficiently large t. 

From Lemma 5.2, we know that for an arbitrary choice of e > 0, there 
exists an iteration t £ such that for all t >t £ , we have at/st+i < e. We will 
choose e = e v := 1 — (2u) l '^~ v ' , for reasons that will become clear later. The 
corresponding iteration t £v will be the T v we are looking for. For t>T u , we 
thus have 

s t = st+i -a t = s t+ i(l - r t ) for some < r t < e v . 

Using this to rewrite the bracketed terms of (9.11) yields 

- s f +i + ±^+1 (st+i - s t ) = s v t+1 [{l - T t f - 1 + \r t ]y 

so that the original claim will follow if we can prove that 

f( T ) : = (1 _ T f - 1 + \ T > for r G [0, e v \. 

We have f(0) = 0, and also, f'(r) = 1/2 - u(l - r) v ~ x . Because 1/2 < v < 
1, /'(t) is a decreasing function of r; by the choice of e„, f'(s u ) = 0, so 
that f'(r) > for r E [0, s v \. Hence /(r) is an increasing function, which is 
positive for r S [0, e v \. We have finished the proof of the lemma. □ 



Proof of Theorem 5.3. Most of the work has already been done in 
the proof of Theorem 5.2. By (9.10), we have t — 1 < (1 — pg\){p — gt)~ 1 {s t — 
sj). Combining this with Lemma 5.4 leads to 

t-l<{l- pg-JCl/^p- g t )- {1+1/v) ■ 

For 5 > 0, we pick v = ug := 1/(2 + S) < 1/2, and we can rewrite the last 
inequality as 

(p-9t) 3+S <(l-p9i)C% S (t-l)-\ 
or more concisely, p — gt < C$(t — l)~ 1 /( 3+<5 ), where 

C& = (l- Wi ) 1 /(3+5) C (2+5)/(3+5). 

It follows that p — p(\t) < p — gt < £ whenever t — 1 > (CfiS -1 )^^ , which 
completes the proof of Theorem 5.3. □ 
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f2l [21 

Proof of Theorem 5.4. We use the notation g t = g\ J , s t = s l t , and 
so forth, since we are using only Algorithm 2. Since rt = p for all t, we 
automatically have 

i fl + pl-9t\ 

(9.12) s t +i = s t + a t = s t + -In — , 

z \ l — p l + gtJ 

and from (9.2), 

(9-13) st+ipt+i = stpt + - In — — . 

2 Vl+P l-p/ 

We will simplify these equations a number of times. For this proof only, we 

[21 

use the notation xt := AG(\ t ) := p — gt to rewrite the quantities 

1+5*., »t , l-5t -, . xt 
1 and = 1 + 



l + p l+p l-p l-p 

Using this notation, we update (9.12) and (9.13), 

(9. 14) s t+1 = s t + I In ( 1 + )- I In (l 



2 V l-p) 2 V l + p/' 

(9.15) Sm5t+1 = ^ + I In (l + ^) + \ In (l - . 

Let us simplify (9.15) further before proceeding. We subtract each side from 
st+ip, using (9.14) to express st+i- This leads to 

st+ixt+i = st+ip - st+igt+i 

(9.16) =StXt _I(i_ p )i^i + _^_^ 

;(1 + P)lnfl 



2 V " V 1 + P 
Now we update (9.14). For y £ [0, 2p], we define 

/^):44l + T^44l y 



2 V l-p/ 2 V l + p/ l-p 2 ' 

where the inequality f p (y) < holds since f p (0) = and f' p (y) < for < y < 
1p. Since we consider the algorithm for only gt > 0, we have xt = p — gt < p, 
so that 

s t+ i = + /p(x t ) + - Xt 2 
1 — p z 

(9.17) 

<S t + o =St\l 



l-p 2 l \ (l-p 2 )s t J' 
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We now update (9.16) similarly. We define, for y £ [0,2p], 



f P (y):=--(l-p)ln(l + 



l-p 



'-{l + p)hx 1 



1 + P 



If 



+ 2_ py 3 



2{l-p 2 ) 3 (l-p 2 ) 2 ' 

where the inequality f p {y) > holds since / p (0) = and since one can show 
f'p(y) > for < y < 2p. It thus follows from x± < p that 



Xt+l 



(9.18) 



s t+1 = x t s t + f p {x t ) + Xt 

2 1 — p z 



2 px\ 



(l-p 2 ) 3(l-p 2 )^ 



> x t s t 



1 + 



px\ 



2{\-p 2 )s t 3(l-p 2 ) 2 s t 



x t 



Suppose now that 
(9.19) 

for t > to, with 5 > 0. We can assume, without loss of generality, that 6 < 2/3. 
By (9.17) we then have, for all t > to, 



<*to + 



l - p 2 7t _i 



n/3i 5 , C (i-l)^- 5 



l-p 2 2/3-5 



It follows that we can define a finite C so that for all t>to, 

(9.20) s t <C"t (2/3) ^. 

Consider now 2^ := x 2 ~ s st- By (9.19) and (9.20) we have, again for t > to, 

z < (j2-S(jl t (2-6)(-(l/3)-6)+(2/3)-S _ ( -«//^<5/3)-2<5+5 2 -5 
< £iHj.(8/3)-28+2(5/3)-5 _ (J"i~^ 

where we have used that S 2 < 2(6/ '3) since 5 < 2/3. It follows that 

(9.21) lim zt = 0. 

t— >oo 

On the other hand, by (9.17) and (9.18), we have 



zt+i = x 2 t+ ls t+ \ = (xt+ist+i 



2-5-1+5 
i s t+i 



> (x t st) 



2-5 



X s 



-1+5 



1 + 



1 + 



Xt 



px t 



2{l-p 2 )s t 3(l-p 2 ) 2 s t 

x t n 
(1-P 2 )^ 



2-5 
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For sufficiently large t, xt will be small so that xt(2p/3(l — p 2 )) < 5/4. Thus, 



zt+i > (x t st) 2 6 



%t ( \ 5 
1+ 2(1- p 2 )s t [ ~2 

(9.22) XSt - 1+5 (l + ^ 



2-5 



z f 



1+ 2(1 -p^stV 2 



1 + 



l~P 2 )st 



Now consider the function <j> s {y) = [l + |(l-|)] 2_5 (l + y) _1+5 . Since <fo(0) = 
1 and «%(y) = 4~ 2 + 5 [4 + y(2 - 5)] 1 " 5 (1 + y)~ 2+5 [2y - y<5 + <5 2 ] , it follows that, 
for sufficiently small y, 

1 S 2 
&(l/)>l + 2#(% = l+g-V- 

Since — > 0, we have lim^oo xt/sj = 0. It then follows from (9.22) that 

/ 5 2 ( x t 
z t +i > zt 1 + — 



for sufficiently large t. This implies Zt+i > z t if x^ > 0, but we always have 
xt > by (3.2). Consequently, there exists a threshold t\ so that z t is strictly 
increasing for t>t±. Together with z tl = s^x 2 ^ 5 > (again because x tl must 
be nonzero), this contradicts (9.21). It follows that the assumption (9.19) 
must be false, which completes the proof. □ 

PROOF of Theorem 6.1. We drop the superscripts, since all variables 
(At, gt, St, pt) will be for arc-gv. In order to prove the convergence rate, we 
need to show that versions of Lemmas 5.1 and 5.3 hold for arc-gv, starting 
with Lemma 5.1. We have, since tanh can be lower bounded as before, and 
since for arc-gv we have tanh(7^ — af vc ) = pt, 
nt 

\ tanh udu> 2 a t TC [tanh 7t + tanh(7j — ctj rc )] 

= ^ar c (rt+Pt)>^r(n + 9t). 

Using the recursive equation (4.2) with arc-gv's update and simplifying as 
in the proof of Lemma 5.1 yields the analogous result 

af c (r t -gt) 

9t+i ~9t> — - • 

2s t +i 

Since the right-hand side is nonnegative, the sequence of g^s is nonnegative 
and nondecreasing; arc-gv makes progress according to the smooth margin. 
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The proof of Lemma 5.3 follows from only the recursive equation (4.2) and 
the nonnegativity of the gts, so it also holds for arc-gv. 

Now we adapt the proof of Theorem 5.2. Since we have just shown that 
the statements of Lemmas 5.1 and 5.3 both hold for arc-gv, we can exactly 
use the proof of Theorem 5.2 from the beginning through equation (9.9); we 
must then specialize to arc-gv. We define A/j,(X t ) = p — pt, 



af c > tanhaf c = tanh[ 7i - ( 7i - c )] 



> — : = Afi(X t 



arc". 
t ) 



tanli7( — tanh(7( — a 
1 — tanli7£ tanh(7t — af rc ) 



1 - niH 
Thus, we have 

t t 

st+i = sj + a i ^ s i + Ap(\ e ) > sj + (t - 1 + 1) min Ap(Xe), 

i=l i=l 

or, changing the index and using min^gi^,, t_i AfiiXf) > min^gi ... % A\x{X(, 
St > sj + (t — l)^min Afi(Xi). 

Combining with (9.9), using AG(Xt) > Ap(X t ) > min^gi,...^ A/x(A^), 
t l< s t < (, i +ln2)[AG(A i )] 2 /d-P) 



min £eli ... it A^(A,) " [min, el ,..., t A / u(A,)]I 1 + 2 /( 1 ^)l ' 
which means that min^ 6 i $ Afj,(Xe) > e is possible only if 

t <1 + ( Si +ln2)e- {3 - p)/il ~ p) . 
If t exceeds this value, min^gi t- „ >t Afi(Xi) < e. This concludes the proof. □ 

10. Proofs from Section 7. 

Proof of Theorem 7.2. We drop the superscripts ^ during this 
proof. We need to show that gt+\ > gt for all t implies that r± — ► r and 
that gt — > T(r). Using the argument of Theorem 7.1, an increase in G means 
that T(rt) = gt + Q where q > 0. Equivalently, by (4.2) and the definition 

of T( n ), 

st+igt+i = T(n)at + s t gt = {gt + c t )a t + s t gt = s t +igt + c t a t , 
and dividing by st+i we have 

, c t a t 

9t+i=gt H • 

St+1 

We need to show that q — ► 0. 
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We are interested only in the later iterations, where AdaBoost is "close" 
to the cycle. To ease notation, without loss of generality we will assume that 
at t = 1, AdaBoost is already close to the cycle. More precisely, we assume 
that for some £ a ^> 0, for all integers a ^ 0, for all ^ k <C T (excluding cl — 0, 
k = since t starts at 1), 

"aT+fc > aioworbd.fc, where ai OW erbd,fc := ( Jim a a T+k ) -e a >0. 
Also, for some s s > 0, for all integers a > 1, for all < k < T, we assume 

SaT+k < asupperbd + s k , where S upperb d > ^2 (iilSo^T+fc) + £ s- 

Since AdaBoost is converging to a cycle, we know that n is not much dif- 
ferent from its limiting value, that is, that for any arbitrarily small positive 
ex there exists T er such that t > T er implies 



T(r t ) - lim T(r t+aT ; 



This implies Y(rj_x) > T(r^) — 2ex for t > T Sr +T. This also implies T(r^_2T) > 
T(rj) — 2ex for t > T £r + 2T, and so on. Let us first choose an arbitrar- 
ily small value for ex • Accordingly, find an iteration t > T £T + T so that 
ci > 2ex > 0. (If t does not exist for any Ex ? the result is trivial since we 
automatically have q — ► 0, which we are trying to prove.) 

First we will show that there is a strict increase in G at the same point 
in previous cycles. Since G is nondecreasing by our assumption, we have 
9t > 9i-T- Thus T ( r t) = 9t + c t ^ 9t-T + c t- Hence, 

T(r f _ r ) > T(rj) - 2e r = g~ t + c~ t - 2e T > g t _ T + c~ t - 2e r . 

Thus, a strict increase occurred at time t — T as well, with ci_ T > — 2ex > 
0. Let us repeat exactly this argument for t — 2T: since G is nondecreasing, 
Ql — 9t-2T- Thus a strict increase in G at t implies 

T(r t -_ 2r ) > T(r t -) - 2e r = g~ t + c~ t - 2e r > g^ 2T + c~ t - 2e r . 

So a strict increase occurred at time t — 2T with c^_ 2 y > — 2ex > 0. Con- 
tinuing to repeat this argument for past cycles shows that if > 2ex > 0, 
then Cj:__ T > 0, c^_ 2 x > 0, c^_ 3T > 0, for iterations at least as far back as T er . 
What we have shown is that a strict increase in G implies a strict increase 
in G at the same point in previous cycles. Let us show the theorem by con- 
tradiction. We make the weakest possible assumption: for some large t, a 
strict increase in G occurs (hence a strict increase occurs at the same point 
in a previous cycle). These iterations where the increase occurs are assumed 
without loss of generality to be aT, where a € {1, 2, 3, . . .}. (If T £r > 1, we 
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simply renumber the iterations to ease notation.) For all other iterations, G 
is assumed only to be nondecreasing. We need to show \\ma->ooCaT = 0. We 
now have for a > 1, 

. , c (a-l)T a (a-l)T - , ^ CaTOLaT 
9aT > 9(a-l)T+l = 9(a-l)T H > 9T + }, • 

Putting this together with s aT+ k < asuppcrbd + Sk and a aT+ k > aiowerbd.fc, 
we find that 

a-1 

. . c aT«lowerbd,0 

9aT >9T + }^ 



a=l ^"^uppcrbd ~i~ $1 

Since s U p per bd and ai ower bd,o are constants, the partial sums become arbi- 
trarily large if no infinite subsequence of the c^t's approaches zero. So, 
there exists a subsequence l',2',3', . . . such that lim a / c a 'T = 0. Considering 
only this subsequence, and taking the limits of both sides of the equation 
Y(?vt) = g a 'T + c a 'T, we obtain 

(10.1) lim T(r a / T ) = lim g a , T . 



a'^oo 



Since AdaBoost is assumed to be converging to a cycle and since l'T, 2'T, 3'T, . . 
is a subsequence of T, 2T, 3T, . . . , then r := lim /_ +00 r a >T exists. Thus, 

(10.2) lim T(tvt) = T(r) = lim T(r aT ). 

a'— >oo a^oo 

Now, since G is a monotonically increasing sequence that is bounded by 1, 

(10.3) lim g a / T = lim g t = lim g aT . 

t'—>oc t^oo a— too 

Recall that by definition, T(r a ^) — g a x = c a T- Taking the limit of both sides 
as a^-oo, and using (10.1), (10.2) and (10.3), we find 

= lim [T(r aT ) - g a r] = hm c aT . 

a— >oo a— >oo 

Thus, even if we make the weakest possible assumption, namely that there 
is a strict increase even once per cycle, the increase goes to zero. In other 
words, our initial assumption was that the c t's are strictly positive (not 
prohibiting other cj's from being positive as well), and we have shown that 
their limit must be zero. So we cannot have strict increases at all, c± — > 0. 
Thus, we must have 

= lim ct = lim [T(r^) — gt], so lim gt = lim T(r^) = T(r). 

This means all rj's in the cycle are identical, rt^> r. We have finished the 
proof. □ 



42 



C. RUDIN, R. E. SCHAPIRE AND I. DAUBECHIES 



Proof of Theorem 7.4. Again we drop superscripts Choose 5 > 
arbitrarily small. We shall prove that limsup t gt < T(p+a)+5 and liminftgt > 
T(p) — 5, which (since 5 was arbitrarily small) would prove the theorem. We 
start with the recursive equation (4.2). Subtracting ottgt from both sides and 
simplifying yields s t +i(g t +i - gt) = T(r t )a t - a t g t , and dividing by s t +i, 

Of 



(10.4) 



gt+i-gt = (T(r t )-g t )- 



St+l 



First we will show that, for some t, if gt is smaller than T(p) — 5, then gt 
must monotonically increase for t > t until g$ meets T(p) —5 after a finite 
number of steps. Suppose gt is smaller than T(p) — 5, and moreover suppose 
this is true for N iterations: T(p) — g$ > 5 > 0, for i G {i, t + 1, t + 2, . . . , t + 
iV}. Then, since T(r^) > T(p), we have 



O'i 



>5- 



tanh 



S i+1 



tanh~ i (p + cr) t+ 1 



>0, 



where we have used that - 



tanh 1 ri > tanh 1 p and , 1 < (t+l) tanh 1 (p + 



o"), which are due to the restrictions on Tt- Recursion yields 



gt+N -gt>S 



>S 



tanh 



P 



tanh 1 (p + a) 
tanh" 1 



1 



+ 



1 



P 



t+l t+2 

t+N+l i 



+ ■•■ + 



1 



t + N 



tanh 1 (p + a) 
tanh" 1 p 



dx 



t+l 



In 1 + 



N 
t + l 



tanh 1 (p + a) 
Because 1 > gt+ n — gt, this implies 

1 tanh" 1 (p + a 



N < (t + 1) exp 



S tanh p 



--:N t . 



It follows that there must be at least one value N in {0, 1,2,..., Nt, Nt + 1} 
such that T(p) — gt+N < 

An identical argument can be made to show that if gt — T(p + a) > 5 > 0, 
then the values of g^, for t>t will monotonically decrease to meet T(p + 
a) + 5. To make this explicit, suppose that gi — T(p + a) > 5 > for t G 
{t,t + l,...,t + M}. Then, since -T(r t ~) > -T(p + a), 



g i -gt +1 = (gi-T(r i )) 



>S- 



tanh p 1 
tanh-^p + cr) t + l' 



By the same reasoning as above, it follows that M cannot exceed some finite 
Mt . Therefore, we must have, for some t 6 {t + l, . . . ,t+ Mt , t + Mt + 1 } , that 
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gi — T(p + a) < 5, and that gt decreases monotonically until this condition 
is met. 

To summarize, we have just shown that the sequence of values of gt cannot 
remain below T(p) — 5, and cannot remain above T(p + a) + 5. Next we 
show that from some to onward, the g^s cannot even leave the interval 
[T(p) — 5, T(p + a) + 5}. First of all, note that we can upper bound \gt+i — gt\, 
regardless of its sign, as follows: 

|fft+i - 9t\ = |T(r t ) - g t \-^- 

st+i 

^ i a i\ tanh _1 (p + cr) 1 1 

< max(T(p + a), 1) — — I _ — — — =: C a — — , 
tanh p t+L t+L 

where we have used \T(r t ) — gt\ < max(T(r t ), g t ) < max(T(p + cr), 1), since 
T(rt) and gt are both positive and bounded. 

Now, if t > C a [T(p+a) — T(p) + =: T\, then the bound we just proved 
implies that the gt for t>T\ cannot jump from values below Y(p) — 5 to 
values above T(p + a) + 5 in one time step. Since we know that the gt 
cannot remain below T(p) —5 or above T(p) + <5 for more than max(A r t , M^) 
consecutive steps, it follows that for t>T±, the (ft must return to [T(p) — 
(5, T(p + <t) + 5] infinitely often. Pick to > T\ so that gt G [T(p) — 5, Y(p + 
cr) + 5]. We distinguish three cases: (ft < T(p), Y(p) < <ft < Y(p + cr) and 
gt Q > T(p + cr). In the first case, we know from (10.4) that gt +i — <ft > 0, 
so that 

9t < 9t +i < 9t + Ca—^—r < T(p) + T(p + a) - T(p) + 5, 
to + 1 

that is, gt +i G [Y(p) — 5, T(p + cr) + 8]. A similar argument applies to the 
third case. In the middle case, we find that 

dist(<fc 0+ i, [T(p),T(p + cr)]) :=max(0,5f io+ i - T(p + cr), T(p) - g to+ i) 

i I c a 

< Mo+i -9t \< ~ — ~T) 
to + 1 

which does not exceed 5 if to > C a 5~ 1 =: Ti- It follows that if to > To := 
max(Ti,T2), and gt G [T(p) — 5,T(p + a) + 5], then gt +\ will likewise be in 
[Y(p) — 5, Y(p + cr) + 8]. By induction we obtain that g t G [Y(p) — 5, Y(p + 
cr) + J] for all t > to- This implies 

liminf (ft > Y(p) — 5 and limsupc/t < Y(p + cr) + 5. 

Since, at the start of this proof, 5 > could be chosen arbitrarily small, we 
obtain liminfV_ >0O g't > Y(p) and lim sup^^ gt < Y(p + cr). 

Note that we do not really need uniform bounds on rt for this proof to 
work. In fact, we need only bounds that hold "eventually," so it is sufficient 
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that limsup t r% < p + a, liminfj r< > p. In the special case where lun^rf = p, 
that is, where a = and p = p, it then follows that linit <^ = T( / o). Hence we 
have completed the proof. □ 

Proof of Theorem 7.5. For any given p and a, we will create a 
matrix M such that edge values can always be chosen within [p,p + cr]. For 
this matrix M, we must also have p> p. Choose a value for p, and choose 
a arbitrarily small. Also, for reasons that will become clear later, choose a 
constant <j> such that 

<t>> --, = , 

1 — p — a 

and choose m > 2<j)/a. As usual, m will be the number of training examples. 
Let M contain only the set of possible columns that have at most m(p-\- 1)/2 
entries that are +1. (We can assume m was chosen so that this is an integer.) 
This completes our construction of M. 

Before we continue, we need to prove that for p of this matrix M, we have 
p < p. For any column j , 

M tj < (+1) KH 2 ' + (-1) (m - Kl 2 J J = mp. 

Thus, for any A € A n , we upper bound the average margin (i.e., the average 
margin over training examples), 

- E E A i A % = E A 4 - E Ma )< E ^ -mp=^^i=^ 

i=lj=l j \ i=l / j j 

We have just shown that the average margin is at most p. There must be 
at least one training example that achieves a margin at or below the average 
margin; thus minj(MA)j < p, and since A is arbitrary, p = max^ eA minj(MA)j < 
p, the maximum margin is at most p. 

We will now describe our procedure for choosing weak classifiers, and then 
prove that this procedure always chooses edge values rt within [p, p + a]. As 
usual, for t = 1 we set = l/m for all i. Let us describe the procedure to 
choose our weak classifier jt, for iteration t. Without loss of generality, we 
reorder the training examples so that dt \ > d± 2 > • ■ • > dt t m, for convenience 
of notation in describing the procedure. We choose a weak classifier jt that 
correctly classifies the first i training examples, where i is the smallest index 
such that 2(5Df=i ~ 1 P- That is, we correctly classify enough exam- 
ples so that the edge just exceeds p. The maximum number of correctly 
classified examples, i, will be at most m(p + l)/2, corresponding to the case 
where dt t \ = • • • = dt, m — Thus, the weak classifier we choose thankfully 
corresponds to a column of M. The edge r t is r t = 2(£)i=i dt,i) — 1 > P- We 
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can now update AdaBoost's weight vector using the usual exponential rule. 
Thus, our description of the procedure is complete. 

By definition, we have chosen the edge such that p <r t . We have only to 
show that rt < p + a for each t. The main step in our proof is to show that 
<f> = K\ = Kt for all t, where for each iteration t, 

K t := max< max — - — 

L »i,»2 dt,i 2 

We will prove this by induction. For the base case t = 1, K\ = max{l, <j>] = eft. 
Now for the inductive step. In order to make calculations easier, we will 
write AdaBoost's weight update in a different way (this iterated map can 
be derived from the usual exponential update) [22, 23]. Namely, 

dtA 



d 



for i < i, 

l + n 

t+hi ~\ d ti 

— , for i > i. 

l-n 



Assuming (f> = Kt, we will show that Kt+\ = Kt- We can calculate the value 
of Kt+i using the update rule written above, 

r , f dt+iM ,1 fmaxjjdt+ij 

Kt+i = max< max 



max 



max 



I h,h dt+i,i 2 J I min i2 d t 
max{rft,i/(l +r t ),d t - i+1 /(l - r t )} ^ 
mm{d t -i/(l + rt),dt, m /(l-rt)} 
d t ,i d tt j + i d t A l~r t d t ~ i+ i l + r t 
d t j ' dt.rn ' dt, m 1 + n ' d t j 1 - r t ' 



By our inductive assumption, the ratios of dtj values are all nicely bounded, 
that is, <Kt = 4>, < 4> and ^N- < <t>. Another bound we have auto- 

' d t i — 1 d t ,m — r dt,m — T 

matically is (1 — rt)/(l + rt) < 1. We have now shown that none of the first 
three terms can be greater than <j>, thus they can be ignored. Consider just 
the fourth term. Since we have ordered the training examples, < 1. If 

we can bound (1 + rt)/(l — rt) by (p, we will be done with the induction. We 
can bound the edge rt from above, using our choice of i. Namely, we chose 
i so that the edge exceeds p by the influence of at most one extra training 
example, 

(10.5) r t < p + 2maxdtA = p + 2dtA. 

i 

Let us now upper bound dtA- By definition of K t , we have < K t , and 
thus dtA < Ktdt m < Kt/m. Here, we have used that dt jTn = minjd t j <l/m 
since the dt vectors are normalized to 1. By our specification that m>2(p/a 
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and by our induction principle, we have dt t \ < Kt/m < 4>a/2(f) = a/2. Using 
(10.5), r t <p + 2a/2 = p + a. (This is by design.) So, 

l — r t 1 - p - a 

Thus, Kt+i = 4>- We have just shown that for this procedure, Kt = 4> for all 
t. 

Lastly, we note that since Kt = 4> for ah t, we will always have rt < p + a, 
by the upper bound for r t we have just calculated. □ 

11. Conclusions. Our broad goal is to understand the generalization 
properties of boosting algorithms such as AdaBoost. This is a large and 
difficult problem that has been studied for a decade. Yet, how are we to 
understand generalization when even the most basic convergence properties 
of the most commonly used boosting algorithm are not well understood? 
AdaBoost 's convergence properties are understood in precisely two cases, 
namely the cyclic case, and the case of bounded edges introduced here. 

Our work consists of two main contributions, both of which use the smooth 
margin function as an important tool. First, from the smooth margin it- 
self, we derive and analyze the algorithms coordinate ascent boosting and 
approximate coordinate ascent boosting. These algorithms are similar to 
AdaBoost in that they are adaptive and based on coordinate ascent. How- 
ever, their convergence can be understood, namely, both algorithms converge 
to a maximum margin solution with a fast convergence rate. We also give 
an analogous convergence rate for Breiman's arc-gv algorithm. Our second 
contribution is an analysis of AdaBoost in terms of the smooth margin. We 
analyze the case where AdaBoost exhibits cyclic behavior, and we present 
the case of bounded edges. In the case of bounded edges, we are able to 
derive a direct relationship between AdaBoost's edge values (which measure 
the performance of the weak learning algorithm) and the asymptotic margin. 

11.1. Open problems. We leave open a long list of relevant problems. 
We have made much progress in understanding AdaBoost's convergence in 
general via the understanding of special cases, such as the cyclic setting 
and the setting with bounded edges. The next interesting questions are even 
more general; for a given matrix M, can we predict whether optimal-case 
AdaBoost will converge to a maximum margin solution? Also, is there a 
procedure for choosing weak classifiers in the nonoptimal case that would 
always force convergence to a maximum margin solution? In this case, one 
would have to plan ahead in order to attain large edge values. 

Another open area involves numerical experiments; our new algorithms 
fall "in between" AdaBoost and arc-gv in many ways; for example, our new 
algorithms have step sizes that are in between arc-gv and AdaBoost. Can 
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we determine which problem domains match with which algorithms? From 
our experiments, we suspect the answer to this is quite subtle, and in many 
domains, all of these algorithms may be tied (within some error precision). 

We have presented a controlled numerical experiment using only Ad- 
aBoost, to show that the weak learning algorithm (and thus the margin) may 
have a large impact on generalization. Other experiments along the same 
lines can be suggested; for example, if the weak learning algorithm is sim- 
ply bounded from above (cannot choose an edge above c where <C c < 1), 
does this restriction limit the generalization ability of the algorithm? From 
our convergence analysis, it is clear that this sort of limitation might yield 
clarity in convergence calculations, considering that a significant portion of 
our convergence calculations are step-size bounds. 

Acknowledgments. Thanks to Manfred Warmuth, Gunnar Ratsch and 
our anonymous reviewers for helpful comments. 
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